Multivariate epidemic count time series model

An infectious disease spreads not only over a single population or community but also across multiple and heterogeneous communities. Moreover, its transmissibility varies over time because of various factors such as seasonality and epidemic control, which results in strongly nonstationary behavior. In conventional methods for assessing transmissibility trends or changes, univariate time-varying reproduction numbers are calculated without taking into account transmission across multiple communities. In this paper, we propose a multivariate-count time series model for epidemics. We also propose a statistical method for estimating the transmission of infections across multiple communities and the time-varying reproduction numbers of each community simultaneously from a multivariate time series of case counts. We apply our method to incidence data for the novel coronavirus disease 2019 (COVID-19) pandemic to reveal the spatiotemporal heterogeneity of the epidemic process.


Introduction
Mathematical and statistical modeling of epidemics is crucial in epidemiology because it provides a theoretical basis for analyzing the spread of infectious diseases and the building blocks of statistical methodologies for data analysis [1,2]. Epidemic modeling has been used to describe a wide variety of phenomena. For instance, the spread of information, opinions, and social behaviors can be modeled as a contagion process [3]. Epidemic modeling lies at the core of a research field that crosses different disciplines. In this study, we developed a multivariate time-series epidemic analysis model.
During an ongoing pandemic, the most commonly available type of data is the daily (or weekly) number of newly reported cases. The time series of case counts provides information on the epidemic size and transmissibility trends. The time-varying effective reproduction number was defined as the expected number of secondary cases arising from a single primary case. They are widely used to monitor these trends. Recent developments in epidemiological data analysis have utilized statistical methodologies to improve the efficiency of estimating the effective reproduction number [4,5]. However, most existing methods are limited to univariate time-series analyses. Although epidemics spread across multiple communities, populations, or regions, these methods can only be used to calculate the univariate effective reproduction a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 numbers of each community separately. To incorporate infection transmission over multiple communities, it is necessary to extend these methods to perform multivariate analysis.
In this study, we propose a multivariate count time series model that describes an epidemic process in multiple nodes. Central to our approach is introducing latent variables to represent successive infections in transmission chains. These latent variables enable us to make posterior inferences regarding the transmission of infections across multiple nodes and develop an EMtype algorithm for estimating the model parameters. In particular, the proposed algorithm can be used to simultaneously estimate the entries in the adjacency matrix and time-varying effective reproduction numbers of each node from a multivariate incidence time series. An application of our method is demonstrated using synthetic and actual data from the COVID-19 pandemic.

Related works
The effective reproduction number can be defined in two approaches: instantaneous and cohort reproduction numbers. The former measures transmission at a specific point in time, while the latter measures transmission in a specific cohort of individuals [6,7]. In the following methods, either of these reproduction numbers are estimated from a time series of case counts.
Wallinga and Teunis's method is based on the likelihood of a renewal model and is commonly used to estimate the cohort reproduction number [8]. In the methods proposed by [9] (EpiEstim) and Bettencourt and Ribeiro [10], the instantaneous reproduction number is estimated using a Bayesian framework. The main difference between these two methods lies in the model assumption. A renewal model is assumed in EpiEstim, similar to the Wallinga and Teunis method, whereas the Bettencourt and Ribeiro method is based on the linearized growth rate of a SIR model. Compared to the SIR model, the renewal model involves simpler parametric assumptions regarding the epidemic process and requires only the generation interval distribution. An advantage of methods based on the renewal model is their simplicity: Parsimony reduces the risk of model misspecification when there are many unknowns in the underlying process. A comprehensive comparison of these three methods was presented by [7].
In a recently proposed method [4], the cohort reproduction number was estimated and combined with the state-space model in a renewal model, and the estimation problem was solved using a recursive Bayesian smoothing procedure. Parag [5] independently proposed a method to estimate the instantaneous reproduction number along the same lines. To the best of our knowledge, these are state-of-the-art methods for estimating the effective reproduction number from an incidence curve.
However, note that all these methods apply only to univariate time-series data.

Multivariate-count time series model
First, we consider a univariate epidemic model. Let n t be the number of cases in which symptoms begin at time t. Given an initial case at time t = 1, the rate of new cases at time t(� 2) can be described as where {ϕ τ } is the serial interval distribution (i.e., the distribution of time from symptom onset in the primary case to symptom onset in the secondary case [11]) and R t−τ is the effective reproduction number at time t − τ.
Before extending it to perform multivariate analysis, we contrast Eq (1) with the conventional renewal model that describes epidemic processes [5,6,9,12].
The most significant difference between these two models lies in their treatment of the effective reproduction number. Here, R t in Eq (2) represents the instantaneous reproduction number, while R t−τ in Eq (1) represents the cohort reproduction number, as shown in S1 File. Additionally, the count n t−τ in Eq (2) represents the number of cases infected at time t − τ; consequently, {ϕ τ } represents the generation time distribution. Therefore, to apply this model to reported cases, it is necessary to adjust the time lag between infection and symptom onset [6]. By contrast, because the case definition in Eq (1) is based on symptoms, there is no need to adjust for the time lag. Hence, we employ Eq (1) as the basis of our multivariate model. We consider an infection process that spreads over D nodes, where each node represents a group of individuals. Let n it be the number of newly reported cases in (i, t), where (i, t) denotes node i at time t. By extending Eq (1) to a multivariate setting, the rate of new cases at (i, t) is given by where R j,t−τ denotes the effective reproduction number at (j, t − τ) and a ij (� 0) represents the transmission ratio from node j to node i satisfying P D i¼1 a ij ¼ 1. Based on the history of previously reported cases, the count n it is assumed to follow a Poisson distribution with the rate Eq (3): The multivariate count time series model comprises the two components in Eqs (3) and (5).

Latent variable representing secondary infection in transmission chain
Now, let us introduce the latent variable y js it , which represents the number of secondary cases at (i, t) infected by the primary cases at (j, s) (s < t). As the total number of new cases at (i, t) is given by n it , the following equality holds: Assuming conditional independence between the transmission events y js it and y j 0 s 0 it for (j, s) 6 ¼ (j 0 , s 0 ) and given N 1:t−1 . The superposition principle can be applied to the Poisson distribution (5), leading to a Poisson distribution for counts y js it with the rate given by c js it ¼ a ij R js � tÀ s n js . Thus, given the sum of independent Poisson random variables n it , the conditional distribution of each element of the Poisson vector Y it � fy js it j j ¼ 1; . . . ; D; s ¼ 1; . . . ; t À 1g is multinomially distributed with count probabilities scaled by the sum of the individual rates: In particular, the conditional expectation of y js it , given n it and N 1:t−1 , is from which posterior inferences can be made regarding secondary infections from the reported incidences.

Parameter estimation
Using latent variables, we develop an expectation maximization (EM)-type algorithm to estimate the model parameters. In this study, we focus on estimating the weighted adjacency matrix A = (a ij ) and time-varying reproduction number R = {R js } of each node from an observed multivariate time series of incidence N 1:T . To estimate these parameters, we consider the following penalized log-likelihood function: where the choice of exponent p 2 {1, 2} and hyperparameter γ � 0 depend on the sparsity or smoothness of the variation in the time-varying reproduction numbers. Rather than maximizing Eq (9) with respect to A and R directly, we iteratively update the estimates to ensure that the objective function LðA; RÞ increases monotonically. To update the parameters, we construct a tight lower bound Q(A, R|A (k) , R (k) ) for the current parameter estimations {A (k) , R (k) } such that Maximizing the function that satisfies these properties ensures that the objective function increases monotonically. Similar to the EM algorithm, the lower bound is obtained as follows: where hy js it i ðkÞ is the conditional expectation of y js it computed from Eq (8) and the current parameter estimations {A (k) , R (k) }. (See S1 File for the derivation.) Updating of A The update for a ij is obtained by maximizing Eq (12) with respect to a ij under constraint P D i¼1 a ij ¼ 1. Using the Lagrange multiplier method, we obtain Eq (13) leads to a natural interpretation of a ij as the fraction of the expected number of secondary cases in node i infected by the primary cases in node j.
The transition density corresponding to the penalty function is given by a Laplace (p = 1) or Gaussian distribution (p = 2). The update of the time-varying reproduction number of node j, fR ðkþ1Þ js g T s¼1 , is then computed using state smoothing for the equivalent state-space model. The details of the smoothing algorithm are provided in S1 File.
The overall algorithm is summarized in Algorithm The free energy (negative log-marginal likelihood) of the state-space model can be computed through state smoothing (see S1 File for details). We use the total free energy, that is, the sum of the free energies of all the nodes, as the criterion for selecting the penalty function (p = 1 or 2) and the value of the hyperparameter γ.

Analysis of synthetic data
We first demonstrate our method using a toy example whereby the infection process is simulated using two nodes (D = 2). To mimic the COVID-19 pandemic, we employed a log-normal distribution for the serial interval distribution {ϕ t } with a mean and standard deviation of 4.7 and 2.9 days, respectively, [13]. The transmission ratios were set as a 11 = a 22 = 0.9 and a 12 = a 21 = 0.1. The time-varying reproduction numbers of each node are shown in Fig 1a. Using these parameters and a given initial infection at t = 1, the model given by Eqs (3) and (5) is used to generate N 1:T . In Fig 1b, we plot the simulated case counts N 1:T for each node (black lines). Because of the profile of the effective reproduction numbers, the incidence curves for both nodes exhibit two waves: the first and second are caused by nodes 1 and 2, respectively.

PLOS ONE
From the simulated case counts N 1:T , we set the penalty function (p = 1 or 2) and the value of the hyperparameter γ based on free energy minimization using a grid search and estimated the model parameters fÂ;Rg. The estimated transmission ratios areâ 11 ¼ 0:8930, a 22 ¼ 0:9047,â 12 ¼ 0:0953 andâ 21 ¼ 0:1070, which are in good agreement with the true values. Fig 1c shows the estimated time-varying reproduction numbers (blue line) of each node with 95% credible intervals (shaded area), along with the true values (yellow line), from which we confirmed that the amplitudes of the reproduction numbers were properly estimated.
Using the estimated parameters and Eq (8), the conditional expectation of the secondary infections, fhy js it ig, was computed, from which we estimated the number of infections that are transmitted from the other node, hy it i � P j6 ¼i P tÀ 1 s¼1 hy js it i (Fig 1b, gray area). As expected, we observed that the second wave in node 1 was dominated by infections transmitted from node 2, whereas the first wave in node 2 was initiated by the outbreak of node 1.
To examine the effect of overlooking inter-node infections on estimates of time-varying reproduction numbers, we estimated R with the identity matrix for A fixed, that is, separate estimates for each node (Fig 1d; 'OnlyR'). The estimated reproductive numbers are substantially biased. The reproduction number of node 1 during the first wave was underestimated because ignoring the infections node 1 transmitted to node 2, whereas the reproduction number during the second wave was overestimated because it counted the infections transmitted from node 2. (The same holds for the estimated reproduction number of node 2.) For comparison, we applied two other estimation methods to estimate the time-varying reproduction numbers. The first method is that proposed by Wallinga and Teunis ('WT'), which is widely used to estimate effective reproduction numbers [8]. In the second method [4], the effective reproduction number is estimated using the state-space method equipped with a Cauchy transition density ("Cauchy"). As these methods are applicable only to univariate time series, the time-varying reproduction numbers for each node are estimated separately. The estimation results obtained using these two methods are plotted in Fig 1e and 1f. As was the case with OnlyR, upward and downward biases were observed in these two methods because the inter-node infections were ignored. The difference in the estimated reproduction numbers among these three methods was attributed to the smoothness assumption made in these estimation methods. In particular, the WT is easily influenced by data fluctuations in the initial phase of the epidemic, and the estimated reproduction numbers decrease at the end of the recorded interval owing to right truncation.
We performed an additional numerical study, in which the number of nodes (dimensions) was varied from D = 10 to 50. The transmission ratios were randomly chosen and normalized such that P D i¼1 a ij ¼ 1, and the time-varying reproduction numbers of each node were randomly chosen between the two profiles, as shown in Fig 1a. To quantify the estimation performance, we compute the average relative error as follows: for time-varying reproduction numbers. We performed the numerical study 10 times with different samples and reported the average performance metrics over the 10 runs. The results are summarized in Table 1. Our method outperformed the other three methods over the range of dimensions examined. This confirmed that the estimation of time-varying reproductive numbers can be significantly improved by considering the transmission of infections across multiple nodes.

Analysis of actual data
We applied our method to actual data from the COVID-19 pandemic in Japan (https://www. mhlw.go.jp/stf/covid-19/open-data.html). The data consist of newly confirmed cases in D = 47 prefectures between January 16, 2020, and November 24, 2021, in which 1,720,441 cases were reported. The data for each week were aggregated to reduce the influence of daily noise on the reported cases. We applied our estimation method to new weekly cases to estimate the parameters fÂ;Rg. We empirically confirmed that the algorithms with different initializations converged to the same estimate. Using the estimated parameters and Eq (8), the conditional expectation of secondary infections, fhy js it ig, was computed, from which we made posterior inferences regarding the infections transmitted across the prefectures. In particular, we computed the expected total number of secondary infections in prefecture i that were transmitted from prefecture j: hy j i i � P T t¼1 P tÀ 1 s¼1 hy js it i. Overall, it was estimated that 82% of the infected cases ( P i hy i i i ¼ 1; 414; 104 cases) were infected within each prefecture ("intra-prefectural infections") and that 18% ( P i P j6 ¼i hy j i i ¼ 306; 337 cases) of the infections were transmitted across prefectures ("interprefectural infections"). Fig 2 shows a matrix visualization of ðhy j i iÞ (Fig 2a, heat map) and bar

PLOS ONE
graphs of the intra-prefectural infections hy i i i, in-degree hy i i � P j6 ¼i hy j i i, and out-degree hy j i � P i6 ¼j hy j i i in each prefecture (Fig 2b). The prefectures are arranged in geographical order from northeast to southwest. Inter-prefectural infections exhibit a community structure that is correlated with the demographic, economic, and industrial characteristics of the prefectures. The largest community was centered around Tokyo (the capital of Japan), the second-largest community around Osaka (the largest prefecture in the west), and the third-largest community was centered around Aichi (the largest industrial area). Fig 3 shows the weekly incidence curves (black lines) along with the estimated infection counts transmitted from the other nodes (gray areas) and estimated time-varying reproduction numbers (blue line) for Tokyo, Osaka, and Aichi. The estimated reproduction numbers exhibited rises and falls that are correlated with the periods whereby the state of emergency was implemented (purple range). In the same figure, we plot the time-varying reproduction numbers estimated using OnlyR (green line), WT (yellow line), and Cauchy (red line). These three estimates exhibit a systematic deviation from that obtained by our method, which indicates that there is a substantial number of inter-prefectural infections transmitted across prefectures. The estimated time-varying reproduction numbers for all the 47 prefectures are shown in S1 Fig.

Conclusion
In this paper, we proposed a multivariate-count time-series model for epidemics spreading across multiple nodes. The central concept of our approach is to introduce latent variables that represent secondary infections in transmission chains. This enabled us to infer infection transmission across multiple nodes and develop an EM-type algorithm for estimating model parameters. The proposed algorithm simultaneously estimates the weighted adjacency matrix and time-varying reproduction numbers for each node from a multivariate time series of incidence. In addition, we formulated a state-smoothing algorithm to estimate time-varying reproduction numbers. This enabled us to use a tool developed for state-space models.
Because the serial interval distribution is fixed, our estimation method is limited to incidence data whereby the statistical properties of the serial interval remain unchanged. While we employed the log-normal distribution estimated in the early phase of COVID-19 [13], the omicron variant, which became prevalent in January 2022, spreads more easily than the earlier variants of SARS-CoV-2 [14]. Therefore, it is necessary to revise the estimate of the serial interval distribution to assess the transmissibility during distinct phases of the pandemic.
Although we assume that the entire outbreak is driven by transmission within the network, the proposed model (3) can include case imports from outside the network as where μ it is the import rate in case (i, t). Accordingly, the cases imported from outside the network, denoted by y 0 it , are incorporated into Eq (6) as for which posterior inference based on the multinomial distribution (7) is applied to differentiate between cases arising from the network and those imported from outside. The rate of case importations, μ it , may be pre-estimated by medical inspection during airport quarantine; if this is not the case, simultaneous estimation of μ it , along with the adjacency matrix and timevarying reproduction numbers, will be developed, which is left for future work.
The concept of introducing latent variables to represent transmission chains is similar to that of the branching representation of the Hawkes process [15][16][17]. A Hawkes-type point process was obtained in the continuous-time limit of the count time-series model. From this perspective, our model can be regarded as its discrete-time counterpart and provides a basis for its application in the analysis of multivariate count time series.